Week 5 Final Project

For the final project, you will identify an Unsupervised Learning problem to perform EDA and model analysis.

Imports

Gather data, determine the method of data collection and provenance of the data

The dataset for this project comes from Kaggle https://www.kaggle.com/datasets/sansuthi/dry-bean-dataset?select=Dry_Bean.csv and is described as follows:

Seven different types of dry beans were used in this research, taking into account the features such as form, shape, type, and structure by the market situation. A computer vision system was developed to distinguish seven different registered varieties of dry beans with similar features in order to obtain uniform seed classification. For the classification model, images of 13,611 grains of 7 different registered dry beans were taken with a high-resolution camera. Bean images obtained by computer vision system were subjected to segmentation and feature extraction stages, and a total of 16 features; 12 dimensions and 4 shape forms, were obtained from the grains.

Attribute Information:

Area (A): The area of a bean zone and the number of pixels within its boundaries.
Perimeter (P): Bean circumference is defined as the length of its border.
Major axis length (L): The distance between the ends of the longest line that can be drawn from a bean.
Minor axis length (l): The longest line that can be drawn from the bean while standing perpendicular to the main axis.
Aspect ratio (K): Defines the relationship between L and l.
Eccentricity (Ec): Eccentricity of the ellipse having the same moments as the region.
Convex area (C): Number of pixels in the smallest convex polygon that can contain the area of a bean seed.
Equivalent diameter (Ed): The diameter of a circle having the same area as a bean seed area.
Extent (Ex): The ratio of the pixels in the bounding box to the bean area.
Solidity (S): Also known as convexity. The ratio of the pixels in the convex shell to those found in beans.
Roundness (R): Calculated with the following formula: (4piA)/(P^2)
Compactness (CO): Measures the roundness of an object: Ed/L
ShapeFactor1 (SF1) ShapeFactor2 (SF2)
ShapeFactor3 (SF3)
ShapeFactor4 (SF4)

The actual class attribute is also supplied with this dataset. While this will not be used during the unsupervised analysis, this will be useful to text the accuracy of the unsupervised models against the actual known classifications.
Class (Seker, Barbunya, Bombay, Cali, Dermosan, Horoz and Sira)

Identify an Unsupervised Learning Problem (6 points)

During this course I was particularly interested in the various clustering methods that were presented, and I would like to gain a little more practice/experience with clustering using this dataset.

The unsupervised learning problem I would like to address is "how well will different clustering algorithms and techniques identify the 7 known dry bean types within the data?"

Import dataset

Exploratory Data Analysis (EDA) - Inspect, Visualize, and Clean the Data (26 points)

Inspect data

Visualise data

From the graphs above, there are no obvious problems or issues with the data. Whilst some of the fields are left or right skews, there are no extreme outliers.

For the table above there are no missing values. This is importantto check as many machine learning algorithms (including PCA - used below!) behave badly or incorrectly with missing values.
If there were missing values I would consider either deleting those rows, or imputing the missing value using the average value of the k-nearest neighbours.

The pair-plots above very clearly show the pairs of features which are highly correlated by showing a straight or curved line. Ideally the pair-plots should be like "clouds", with no obvious patterns of any correlation or similarity.

Feature Reduction (PCA)

Given the very high correlation between some of the features, it makes sense to reduce the number of features before continuing with any data mining.

Principal Component Analysis (PCA) is a good choice, and it will be interesting to see how many significant features we will end up with down from the intitial 16. It is particularly handy to get down to 2 principal components, as these are easily plotted on a scatter plot and (hopefully) will reveal clusters visually.

Standardize the Data
PCA is effected by scale so you need to scale the features in your data before applying PCA. Use StandardScaler to help you standardize the dataset’s features onto unit scale (mean = 0 and variance = 1) which is a requirement for the optimal performance of many machine learning algorithms.

Perform PCA.
Set PCA to find only enough Principal Components to explain 99% of the variance

From the output above, the PCA process has managed to reduce the number of features from 16 down to 7, and still be able to account for 99% of the variation in the data.

The charts above show the loading of the original features coefficients.
The most significal principle component PC0 seems to use a relativeley even weighting of all 16 features. This would suggest that no single feature has a significant influence on the class variable.

Just as a safety check, the charts above confirm that indeed that 7 principle components are sufficient to explain 99% of the variance in the data.

The pair plot above of the 7 principal components is looking good, in that they all look like random clusters of data with no obvious correlations between themselves.
Good job PCA!

Perform Analysis Using Unsupervised Learning Models of your Choice, Present Discussion, and Conclusions

Now that the feature reduction is complete, we are ready to apply one or more unsupervised mechine learning algorithms to these principal components.

Simple K-Means Clustering

The first choice of model will be simple clustering where the optimal number of clusters is unknown. Let's loop around and try a range of cluster numbers, and use the silhouette coefficent to select the best one. Hopefully this will be 7!

Well that did not go well at all! The chart above gives no indication that there may be 7 clusters in that dataset, but more like 2 or 3.
We got a hint from PCA that this might happen, as the first principal component used almost all of the original 16 features with very similar weightings.

Hierarchical Clustering

Let's try a hierarchical clustering model and see if it fares any better.

As explained in the course lecture, the choice of affinity and linkage methods can greatly affect the outcome of the hierarchical clustering algorithm. The code below tries a variety of different combinations on the principal components and displays the dendrograms.

Conclusion

Well this is awkward - none of the various combinations of parameters give the slightest hint that there may be in fact 7 groups in the dataset!
I suppose the conclusion to be drawn from this is that it is actually quite difficult to tell the seven types of dry bean apart from the 16 measurements, and that unsupervised learning techniques such as k-means clustering and hierarchical clustering are just not able to discover any pattern to effectively distinguish them apart. In other words, for this dry bean dataset a supervised model in required.

Supervised Multiclass Logistic Regression

As a final quick check, lets apply a supervised learning algorithm to the dataset and confirm that it is indeed able to learn an accurate model.
Given that the data is all numeric, and there are 7 different calls values, I will use the Multi-class Logistic Regression algorithm.